Physics 474 - Spring 2023
Homework 2 - Fitting Data, Parameter Estimation and Confidence Interval
Author: Djamil Lakhdar-Hamina
In this homework we will practice fitting a function with parameters to some data. In addition, we will place some emphisis also on determining the confidence interval for the fit parameters.
skills we will excercise:
We will be investgating something called the Cosmic Microwave Background or CMB for short.
Quoted from Wikipedia:
"In Big Bang cosmology the cosmic microwave background (CMB, CMBR) is electromagnetic radiation that is a remnant from a primordial stage of the universe, also known as "relic radiation". The CMB is faint cosmic background radiation filling all space. It is an important source of data on the early universe because it is the oldest electromagnetic radiation in the universe, dating to the epoch of recombination when the first atoms were formed. With a standard optical telescope, the background space between stars and galaxies is almost completely dark. However, a sufficiently sensitive radio telescope detects a faint background glow that is almost uniform and is not associated with any star, galaxy, or other object. This glow is strongest in the microwave region of the radio spectrum. The accidental discovery of the CMB in 1965 by American radio astronomers Arno Penzias and Robert Wilson was the culmination of work initiated in the 1940s, and earned the discoverers the 1978 Nobel Prize in Physics."
We will be using measurements of CMB microwave intensity (W/m^2/sr/Hz) vs microwave frequency (GHz) compiled around 2000 in the source
Salvaterra and Burigana, 2000 arXiv:astro-ph/0206350
I have compiled the data into an easily read NumPy binary data file
"CMB_Intensity_Data.npz"
We will use this data to:
Background for the problem:
Black-Body Radiation
Every physical body (including the CMB surface) of temperature spontaneously and continuously emits electromagnetic radiation of radiance which describes the spectral emissive power per unit area, per unit solid angle, per unit frequency for particular radiation frequencies . The relationship is given by Planck's radiation law
where is Plank's constant, is Boltzman's constant and is the speed of light. ___________________________________________________________________
You are given data for measurments of vs frequency and are being asked to find the temperature that gives the best-fir between Plank's law and the data. You will then compare the best-fit to the data and estimate the confidence interval on the parameter . ___________________________________________________________________________
Part 1) (2 pts)
Read in the datafile 'CMB_Intensity_Data.npz' and print the "keys" in the file
e.g.
filename = 'CMB_Intensity_Data.npz'
Data = np.load(filename)
print(Data.files)
# Your code...
from math import exp
import numpy as np
import pandas as pd
import plotly.express as pex
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from scipy.optimize import curve_fit
import scipy.stats as st
## define constants
h=6.62607015e-34
k=1.380649e-23
c=299792458
best_cmb_temp=2.72548
## useful functions
planck_radiation_law=lambda f,T: (2*h/c**2)*(f**3/(np.exp(h*f/(k*T))-1))
def chi_squared(theory:np.array,data:np.array,sigma:np.array)->np.array:
"""
This function calculates TOTAL chi-squared between Theory and Data using sigma
as errors.
The 3 arrays must be of equal size.
Note: This is NOT reduced chi-squared
Usage:
inputs: theory = input hypothesis (or Theory)
data = Data points
sigma = uncertainty on data points
output: if arrays are of equal size returns the TOTAL chi-squared
if arrays are not of equal size returns -1.0
"""
if np.size(theory)==np.size(data) and np.size(data)==np.size(sigma):
return np.sum((theory-data)**2/sigma**2)
else:
print('error - arrays of unequal size')
return -1.
def delta_chi_squared(best_fit_parameter:float,
theory,
data_x:np.array,
data_y:np.array,
data_err:np.array,
chi2:float,
sigma_range:int,
sigma:float,
n_bins:int)->np.array:
"""
This function calculates Delta chi-squared between Theory and Data using sigma
as errors.
The 3 arrays must be of equal size.
Note: This is NOT reduced chi-squared
Usage:
inputs: theory = input hypothesis (or Theory)
data = Data points
sigma = uncertainty on data points
output: if arrays are of equal size returns the TOTAL chi-squared
if arrays are not of equal size returns -1.0
"""
binvals = np.linspace(best_fit_parameter - sigma_range*sigma, best_fit_temp + sigma_range*sigma,n_bins)
delta_chi_square = np.zeros(n_bins)
for i in range(n_bins):
delta_chi_square[i] = chi_squared(theory(data_x, binvals[i]),data_y,data_err) - chi2
return binvals,delta_chi_square
## import data
filename='CMB_Intensity_Data.npz'
cmb_intensity_data=np.load(f'data/{filename}')
cmb_intensity_data.files
['description', 'frequency', 'intensity', 'error']
Print the 'description'
e.g.
print(Data['description'])
## print description of data
cmb_intensity_data['description']
array(['-----------------------------------------------',
'This file contains data for Cosmic Microwave Background (CMB)',
'Data compiled from salvaterra and burigana, 2000 arXiv:astro-ph/0206350',
'Intensity measurements versus microwave frequency',
'The data is given as frequency (GHz)',
'CMB Intensity (W/m^2/sr/Hz) and error on Intensity',
'description = this text describing data',
'frequency = Frequency of mesurement in GHz',
'intensity = CMB Intensity (W/m^2/sr/Hz)',
'error = estimated experimental uncertainty of intensity in same units',
'NOTE:', 'Removed 1 data point f= 113.6 Ghz T=2.279 K',
'and used delta_T = 0.025K in place of 0.01K for last 40 points',
'--------------------------------------------------'], dtype='<U71')
Part 2) (3 pts)
Make a plot of the data with errorbars vs frequency [GHz]
# define plots for errorbars vrs frequency
fig = go.Figure(data=go.Scatter(
x=cmb_intensity_data['frequency'],
y=cmb_intensity_data['intensity'],
error_y=dict(
type='data', # value of error bar given in data coordinates
array=cmb_intensity_data['error'],
color='orange')
,mode='markers' ))
fig.update_layout(
title="Intensity vrs. Frequency: Spectral Density of Blackbody",
xaxis_title="Frequency (Ghz)",
yaxis_title="Intensity (W/m^2/sr/Hz)",
legend_title="error",
font=dict(
family="Courier New, monospace",
size=10,
color="RebeccaPurple"
),
xaxis = dict(
tickmode = 'array',
tickvals = [.1,1,10,100,1000],
),
yaxis = dict(
tickmode = 'array',
tickvals = [10**-22,10**-21,10**-20,10**-19,10**-18],
)
)
fig.update_xaxes(type="log")
fig.update_yaxes(type="log")
fig.show()
Part 3a) (3 pts)
fit the data above to Plank's law to get the best fit temperature
# predicting the cmb temperature via chi-square
T0 = 2.0
poptT, pcovT=curve_fit(f=planck_radiation_law,
xdata=cmb_intensity_data['frequency']*1e9, ## convert from Ghz to Hz
ydata=cmb_intensity_data['intensity'],
p0=T0,
sigma=cmb_intensity_data['error'],
absolute_sigma=True)
sigma=np.sqrt(np.diag(pcovT))[0]
print(f"Predicted temperature of the CMB : {poptT[0]}")
print(f"The 1-sigma error : {np.sqrt(np.diag(pcovT))[0]}")
Predicted temperature of the CMB : 2.727795133568652 The 1-sigma error : 0.003575932124145145
Part 3b) (2 pts)
Calculate and print the , reduced-, and probability
## calculating chi square , dof, and probability
chi2=chi_squared(planck_radiation_law(cmb_intensity_data['frequency']*1e9,*poptT),cmb_intensity_data['intensity'],cmb_intensity_data['error'])
dof = np.size(cmb_intensity_data['intensity'])-1 #degrees of freedom = data points - #parameters (here 1)
prob = st.chi2.sf(chi2,dof) #chi-squared probability from scipy function
print(f"chi-squared : {chi2}")
print(f"χ-mu : {chi2/dof}")
print(f"dof : {dof}")
print(f"prob : {prob}")
chi-squared : 117.93663675312395 χ-mu : 1.281919964707869 dof : 92 prob : 0.03551609242598948
Observations:
The best-fit temperature acheived from curve-fit gives a reduced chis-square of around 1.28, indicating that our theory plus the best fit parameter track the data quite well (since it is reasonably close to 1). The low chi-square probability means that around 3.5 % of the time we would get a higher chi-square simply by chance. Empirically, our best fit determines a temperature of around 2.73K which is in fact the approximate value quoted in the literature for the average temperature for the cosmic microwave background radiation. Chi-square minimization seems to be a good method for determining the temperature of the CMB. _______________________________________________________________________
Part 4) (5 pts)
Make a single figure with two subplots
# sort data for the fitted line
sorted_data=np.sort(cmb_intensity_data['frequency'])
fit_data=planck_radiation_law(cmb_intensity_data['frequency']*1e9,*poptT)
fractional_residuals=(cmb_intensity_data['intensity']-fit_data)/fit_data
fig = make_subplots(rows=2, cols=1)
## actual data in intesnity vrs frequency hz
fig.add_trace(
go.Scatter(
x=cmb_intensity_data['frequency'],
y=cmb_intensity_data['intensity'],
error_y=dict(
type='data', # value of error bar given in data coordinates
array=cmb_intensity_data['error'],
color='orange')
,
mode='markers',name='Frequency' ),row=1,col=1),
## best fit line
fig.add_trace(
go.Scatter(
x=sorted_data,
y=planck_radiation_law(sorted_data*1e9,*poptT),name='Best Fit: Planck Law'),row=1,col=1)
## error bar line
fig.add_trace(
go.Scatter(
x=cmb_intensity_data['frequency'],
y=fractional_residuals ,mode='markers',error_y=dict(type='data',array=cmb_intensity_data['error']/fit_data,color='orange'),name='Fractional Residuals'),row=2,col=1)
fig.update_layout(
title="Intensity vrs. Frequency: Planck's Law and the Spectral Density of Blackbody",
font=dict(
family="Courier New, monospace",
size=10,
color="RebeccaPurple"
)
,height=800,width=1000)
fig.update_xaxes(title_text="Frequency (GHz)", type="log",tickmode='array',tickvals=[.1,1,10,100,1000], row=1, col=1,showticklabels=True)
fig.update_yaxes(title_text="Intensity (W/m^2/sr/Hz)", tickmode='array',tickvals=[10**-22,10**-21,10**-20,10**-19,10**-18], type='log',row=1, col=1)
fig.update_xaxes(title_text="Frequency (GHz)", type="log",tickmode='array',tickvals=[.1,1,10,100,1000], row=2, col=1)
fig.update_yaxes(title_text="Fractional Residuals (Intensity)", row=2, col=1)
fig.show()
Observations:
The top-image suggests that our theory+best-fit parameter leads to a very good fit for the data, surprisingly good. However, the bottom-image does clarify that in fact there is some discrepancy between fit and data. It seems that errors are more so distributd in the low-frequency range than in the high. However, there are many more points in the higher-frequency range, so that overwhelmingly residuals are neaer the zero-line. That being said, the best-fit line intersects nearly every single point, and the relatively low fractional residuals, especially in higher-frequency range, are further evidence that the best fit parameter of around 2.73K is close to the true parameter. Furthermore, it would seem that our one-parameter theory , Planck's law, is well confirmed when given the proper parameter.
Part 5) (5 pts)
Now calculate the vs temperature for (based on return of curve_fit) around the best-fit temperature . That is:
make a plot
# sigma as determined by covariance matrix
sigma=np.sqrt(np.diag(pcovT))[0]
best_fit_temp=poptT[0]
chi_square_best_fit=chi2
n_bins = 10000
tempvals = np.linspace(best_fit_temp - 4*sigma, best_fit_temp + 4*sigma,n_bins)
delta_chi_square = np.zeros(n_bins)
for i in range(n_bins):
delta_chi_square[i] = chi_squared(planck_radiation_law(cmb_intensity_data['frequency']*1e9, tempvals[i]),cmb_intensity_data['intensity'],cmb_intensity_data['error']) - chi2
layout=dict(
title="$\Delta\chi(T)^{2} vrs. Temperature (Kelvin)$ ",
xaxis_title="Temperature (K)",
yaxis_title=r"$\Delta\chi(T)^{2}$",
font=dict(
family="Courier New, monospace",
size=10,
color="RebeccaPurple")
,height=800,width=1000 ,
yaxis = dict(
tickmode = 'array',
tickvals = list(range(0,20,5)),
),
xaxis = dict(
tickmode = 'array',
tickvals = [best_fit_temp +i*sigma for i in range(-4,5)],
ticktext = ["$T_{fit}-4\sigma$",
"$T_{fit}-3\sigma$","$T_{fit}-2\sigma$",
"$T_{fit}-\sigma$",
"$T_{fit}$",
"$T_{fit}+1\sigma$",
"$T_{fit}+2\sigma$",
"$T_{fit}+3\sigma$",
"$T_{fit}+4\sigma$"]
))
temp_delta_chi_trace = dict(
x = tempvals,
y = delta_chi_square,
mode = 'lines',
type = 'scatter',
line_shape='spline',
connectgaps = True,
name="$\Delta\chi(T)^{2}$",
showlegend=True)
best_fit_temp_trace=dict(
x=[best_fit_temp]*10,
y=list(range(0,18,2)),
mode = 'lines',
type = 'scatter',
line_shape='spline',
line_color='purple',
name='Best Fit Temperature (K)')
best_current_temp_trace=dict(
x=[best_cmb_temp]*10,
y=list(range(0,18,2)),
mode = 'lines',
type = 'scatter',
line_shape='spline',
line=dict(color='red', width=1, dash='dash'),
name="Current Best Measurement (K)")
data=[temp_delta_chi_trace,best_fit_temp_trace,best_current_temp_trace]
fig = go.Figure(data = data,layout=layout)
for num in [1,4,9]:
fig.add_hline(y=num,opacity=.4,line_color='red',annotation_text=f"$\Delta\chi(T)^{2}={num}$")
fig.show()
Summary and Conclusions:
(comment on)
For a model with one-degree of freedom, the horizontal lines , in intersection with the graph, represent lines which when projected on the x-axis, give confidence intervals. For the 68% confidence interval, at , the fit temperature is between 2.726696526942884 and 2.7338516562404718. For the 95% interval, the fit temperature is between 2.7231189622940897 and 2.737429220889266. The 1-line intersects the graph, and determies a region of x between two points of intersection. The projection of that line onto then x-axis represents that 68.3 % of the time the parameter, the CMB temperature, is within that region between and . The 4-line determines a region which when projected x-wise represents the 95.4 % confidence interval. 95.4% of the time the parameter lies within the region between and . Finally, the 9-line represents the determination of the region between and where we have 99.73 % confidence that the true parameter is within that x-region i.e. 99.73 % of the time our parameter is found in that region. For a model of one-degree of freedom, the lines 1,4,9 intersect at multiples of , determined from the covaraince matrix, at respectively 1,2,3 . We see a complete overlap of the 1,4,9 line and multples of the error. This overlap may be an artifact of the way the x-ticks were created, or have to do with the fine-grained nature of the binning i.e. the fact that the was computed for over 10000 different values. The "true" parameter, corresponding to the state of the art measurement of the CMB temperature, is not even from chi-square minimized value. Our procedure therefore determined a value with low error, or difference from the true value. It would seem that on the one hand not only have we determined a relatively close value to the CMB temperature but also that the Planck radiation law is a firm basis for making such a prediction.